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Abstract. 

The idea that the distance among pairs of languages can be evaluated from lexical 
differences seems to have its roots in the work of the French explorer Dumont D'Urville. 
He collected comparative words lists of various languages during his voyages aboard 
the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the 
Pacific, he proposed a method to measure the degree of relation between languages. 

The method used by the modern lexicostatistics, developed by Morris Swadesh in 
the 1950s, measures distances from the percentage of shared cognates, which are words 
with a common historical origin. The weak point of this method is that subjective 
judgment plays a relevant role. 

Recently, we have proposed a new automated method which is motivated by the 
analogy with genetics. The new approach avoids any subjectivity and results can be 
easily replicated by other scholars. The distance between two languages is defined by 
considering a renormalized Levenshtein distance between pair of words with the same 
meaning and averaging on the words contained in a list. The renormalization, which 
takes into account the length of the words, plays a crucial role, and no sensible results 
can be found without it. 

In this paper we give a short review of our automated method and we illustrate it 
by considering the cluster of Malagasy dialects. We show that it sheds new light on 
their kinship relation and also that it furnishes a lot of new information concerning 
the modalities of the settlement of Madagascar. 
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Introduction 



The use of Swadesh lists pQ in lexicostatistics is popular since half a century. They are 
lists of words associated to the same M meanings, (the original Swadesh choice was 
M = 200) which concern the basic activities of humans. The choice is motivated by the 
fact that these terms, which are learned during childhood, change very slowly over time. 
Comparing the two lists corresponding to a pair of languages it is possible to determine 
the percentage of shared cognates which is a measure of their lexical distance. Then, 
provided that vocabularies change at a constant rate, this lexical distance is roughly 
logarithmically proportional to the divergence time. A recent example of the use of 
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Swadesh lists and cognates counting to construct language trees are the studies of Gray 
and Atkinson [2] and Gray and Jordan [3]. 

Cognates are words with a common historical origin, their identification is often a 
matter of sensibility and personal knowledge. In fact, the task of counting the number 
of cognate words in the list is far from trivial because cognates do not necessarily look 
similar. Therefore, subjectivity plays a relevant role. Furthermore, results are often 
biased since it is easier for European or American scholars to find out those cognates 
belonging to the western languages. For instance, the Spanish word leche and the Greek 
word gala are cognates. In fact, leche comes from the Latin lac with genitive form lactis, 
while the genitive form of gala is galactos. Also the English wheel and the Hindi cakra 
are cognates. These two identifications were possible because of our historical records, 
hardly they could have been possible for languages, let's say, of Australia. 

The idea of measuring relationships among languages using vocabulary is much 
older than lexicostatistics and it seems to have its roots in the work of the French 
explorer Dumont D'Urville. He collected comparative word lists of various languages 
during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the 
geographical division of the Pacific pE], he proposed a method to measure the degree of 
relation among languages. He used a core vocabulary of 115 terms which, impressively, 
contains almost all the meanings of the 100-items lists of Swadesh. Then, he assigned 
a distance from to 1 to any pair of words with the same meaning and finally he was 
able to determine the degree of relation between any pair of languages. 

Our automated method (5j[6] has some advantages: the first is that, at variance with 
previous approaches, it avoids subjectivity, the second is that results can be replicated 
by other scholars assuming that the database is the same, the third is that it is not 
requested a specific expertize in linguistic, and the last, but surely not the least, is 
that it allows for a rapid comparison of a very large number of languages. For any 
language we write down a Swadesh list, then we compare words with same meaning 
belonging to different languages only considering orthographic differences. This may 
appear reductive since words may look similar by chance, while cognate words may 
have a completely different orthography, but we will try to convince the reader that 
indeed this is a simpler, more objective and more efficient choice with respect to the 
traditional lexicostatistics approach. This method is motivated by the analogy with 
genetics: the vocabulary has the role of DNA and the comparison is simply made by 
measuring the differences between the DNA of the two languages. 

If a family of languages is considered, all the information is encoded in a matrix 
whose entries are the pairwise lexical distances, nevertheless, this information is not 
manifest and it has to be extracted. The typical approach to this problem is to transform 
the matrix information in a phylogenetic tree. Nevertheless, the tree encodes only 
the information concerning the vertical transmission between languages. Therefore, 
we also propose a complementary geometrical approach which transfers the matrix 
information into languages positions in a n- dimensional euclidean space and which also 
takes into account the horizontal transmission. The method is tested against the cluster 
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of Malagasy dialects and it shows to be able find out new important aspects of their 
internal organization and about their origin. 

Lexical distances 

In order to describe our method, we start by our definition of lexical distance between 
two words, which is a variant of the Levenshtein distance [7] . The Levenshtein distance 
is simply the minimum number of insertions, deletions, or substitutions of a single 
character needed to transform one word into the other. Our distance is obtained by a 
renormalizat ion . 

More precisely, given two words lo\ and u 2 , their distance d(ux, lo 2 ) is defined as 



where di,{uji, u 2 ) is their standard Levenshtein distance and l(ui,u 2 ) is the number of 
characters of the longer of the two words uj\ and u 2 . Therefore, the distance can take 
any value between and 1. 

The reason of the renormalization can be understood by the following example. 
Consider the case of two words with the same length in which a single substitution 
transforms one word into the other. If they are short, let's say 2 characters, they are 
very different. On the contrary, if they are long, let's say 8 characters, it is reasonable 
to say that they are very similar. Without renormalization, their distance would be 
the same, equal to 1, regardless of their length. Instead, introducing the normalization 
factor, in the first case the distance is ~, whereas in the second, it is much smaller and 



We use distance between pairs of words, as defined above, to construct the lexical 
distances of languages. For any language we prepare a list of words associated to the 
same M meanings (we adopt the original Swadesh choice of M = 200). 

Assume that the number of languages is N and any language in the group is labeled 
by a Greek letter (say a) and any word of that language by on with 1 < i < M. The 
same index i corresponds to the same meaning in all languages i.e., two words «j and 
f3j in the languages a and /3 have the same meaning if % — j. 

The lexical distance between two languages is then defined as 



It can be seen that D(a,(3) is always in the interval [0,1] and obviously D(a,a) = 0. 

The result of the analysis described above is a iV x JV upper triangular matrix 
whose entries are the N(N — l)/2 non-trivial lexical distances D(a, (3) between all pairs 
of languages. 

It is important to notice that although the matrix of distances encodes all the 
information concerning relationships among the N languages, this information is not 




(1) 



equal to |. 




(2) 
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manifest and it has to be extracted. The typical approach to this problem is to transform 
the matrix in a phylogenetic tree. 

Nevertheless, in this transformation, part of the information may be lost because 
transfer among languages is not exclusively vertical (as in mtDNA transmission from 
mother to child) but it also can be horizontal (borrowings and, in extreme cases, 
creolization). 

Another approach is the geometric one that results from Structural Component 
Analysis (SCA) that we have recently proposed [E]. This approach encodes the matrix 
information into the positions of the languages in a n-dimensional space. For large n 
one recovers all the matrix content, but a low dimensionality, typically n=2 or n=3, is 
sufficient to grasp all the relevant information. 

Other methods can be used for specific purposes (see P [TUJ, [HI [12]) and other 
information can be extracted by the matrix combining different approaches and/or 
comparing with other information sources as, for example, the matrix of geographical 
distances between the homelands of languages [IS] . 

We would like also to mention that the method described here was later used and 
developed by another large group of scholars [13]. They placed the method at the core of 
an ambitious project, the ASJP (The Automated Similarity Judgment Program) whose 
aim, in the words of its proponents, is " achieving a computerized lexicostatistics analysis 
of ideally all the world's languages" . 

Malagasy dialects 

We demonstrate now the method by applying it to the Malagasy dialects which are 
regional variants of the same language of Indonesian origin. Indeed, the nearest relative 
is Maanyan which is spoken by a Dayak community in Borneo [I7J, [18]. A relevant 
contribution also comes from loanwords of other Indonesian languages as Malay [19] 
and also African ones [20J. 

The vocabulary was collected by the author with the invaluable help of Joselina 
Soafara Nere at the beginning of 2010. The dataset, which can be found in [H], consists 
of 200 words Swadesh lists for 23 dialects of Malagasy from all the areas of the island. 
Since the number of dialects is iV=23, the output of our method is a matrix with 
N(N — l)/2 = 253 non-trivial entries representing all the possible lexical distances 
among dialects. 

The information concerning the vertical transmission of vocabulary from the proto- 
Malagasy to the contemporary dialects can be extracted by a phylogenetic approach. 
There are various possible choices for the algorithm for the reconstruction of the family 
tree (see [IS] for a discussion of this point), we show in Fig. 1 the output of the 
Unweighted Pair Group Method Average (UPGMA). The input data for the UPGMA 
tree are the pairwise separation times obtained from the lexical distances by means 
of a simple logarithmic rule ([SI E])- The absolute time-scale is calibrated by the 
results of the SCA analysis, which indicate a separation date A.D. 650 as it will be 
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explained later. The phylogenetic tree in Fig. 1 interestingly shows a main partition of 
Malagasy dialects in two main branches (east-center-north and south-west) at variance 
with previous studies which gave a different partitioning [2T] (indeed, the results in [21] 
coincide with ours if a correct phylogeny is applied, see [15] for a discussion of this 
point.) Then, each of two branches splits, in turn, in two sub-branches whose leaves 
are associated to different colors. In order to demonstrate the strict correspondence of 
this cladistic with the geography, we display a map of Madagascar (Fig. 2) where the 
locations of the 23 dialects are indicated with the same colors of the leaves in Fig. 1. 

Trees are ubiquitous in representations of languages taxonomies, nevertheless, they 
fail to reveal all the information contained in the matrix of lexical distances. The 
reason is that the simple relation of ancestry, which is the single principle behind 
a branching family tree model, cannot account for the complex interactions among 
dialects in real time. Structural Component Analysis (SCA) represents the relationships 
among different languages as positions in a n-dimensional euclidean space (see [8] for a 
description of the method). For a large number n of dimensions all the information of 
the matrix is recovered, nevertheless, a low dimension (n—2 or n—3) is usually sufficient. 

In the case of Malagasy dialects, it happens that they belong to the same plane and 
therefore a dimension n= 2 is already enough. If one also considers an external language, 
as for example Malay and Maanyan (which, as Malagasy, belong to the Austronesian 
family), one finds that they stay on a different plane and, therefore, the minimum 
dimension for a complete description is n—3. In order to make this comparison it was 
necessary to compute the lexical distances between all the 23 Malagasy dialects and 
the two Indonesian languages. The main source for Malay and Maanyan vocabularies 
was [16], implemented and corrected by the authors (our database can be consulted at 
[H]). Any of the points in the 3-dimensional space can be individuated by its radial and 
angular coordinates. In Fig. 3 we plot the zenith angle 9 and azimuth angle ip in the 
the two cases: 23 dialects + Maanyan and 23 dialects + Malay. The angles clustering 
in Fig. 3 confirms the 4-partitions of Malagasy dialects and, in particular, it confirms 
the relative isolation of the Antandroy variant (yellow) which, nevertheless, belongs to 
the same plane of the other Malagasy dialects, at variance with Malay and Maanyan. 

It is worth to mention that the distribution of the dialects along the radial direction 
is remarkably heterogeneous indicating that the rate of changes in the vocabulary 
was anything but stable over time. While the angular distribution gives informations 
concerning the proximity of languages and the clustering of the family, the radial 
distribution gives information about the date in which the divergence of dialects started. 
In fact, the variance of the radial distribution is proportional to the time lag from the 
breaking of the unity of the proto-language [8j [15] . If the method is applied to our data, 
it gives a lag of about 1350 years. Presumably, the beginning of divergence coincided 
with the colonization event, therefore, the variance gives the date 650 A.D. for the 
landing of the Indonesian seafarers, a result which agrees with the date proposed by 
other scholars as Adelaar [TBI tlH] • 

Finally, we plotted the distances of Malagasy dialects from Maanyan and Malay 
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and we have shown them in Fig. 4. We observe that some dialects (Antananarivo, 
Fianarantsoa, Manajary, Manakara) have a smaller distance both from Maanyan and 
Malay. This suggests a scenario according to which there was a migration on the 
highlands of Madagascar (Betsileo and Imerina regions) shortly after the landing on 
the south-east coast (Manakara, Manajary). The same indication for a south-east 
landing comes from the fact that linguistic diversity is higher in that region (see [T5]). 
Furthermore, these findings are supported by information concerning geography, in fact, 
an oceanic current from Indonesia links the Sunda strait with the south-east coast of 
Madagascar [T5] . 

Discussion and outlook 

The method that we have presented has many advantages as rapidity, objectivity and 
reproducibility, nevertheless, we think that the main progress is that it can be used 
in a kind of blind mode. In this way, it is possible to avoid errors which may arise 
from preconceptions which may influence the results of more traditional and subjective 
approaches. 

We applied to Malagasy dialects, and we found a consistent new representation of 
their phylogeny. We also found a date (650 A.D.) and a place (south-east coast) for the 
founding event. 

Concerning Madag still unexplained mystery about the migration from 

Indonesia is that the closest language to Malagasy is Maanyan which is spoken by 
an ethnic group of Bornean Dayaks. The problem is that it is unlikely that the Dayaks 
headed the spectacular migration from Kalimantan to Madagascar, since they are forest 
dwellers with river navigation skills only. We plan to investigate this mystery comparing, 
by our method, Malagasy dialects with various Indonesian languages. 
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Figure 1. Phylogenetic tree of 23 Malagasy dialects realized by Unweighted Pair 
Group Method Average (UPGMA). The phylogenetic tree shows a main partition of 
the Malagasy dialects into four main groups associated to different colors. The strict 
correspondence of this cladistic with the geography can be appreciate by comparison 
with Fig. 2. 
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Figure 2. Geography of Malagasy dialects. The locations of the 23 dialects are 
indicated with the same colors of Fig. 1 
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Figure 3. Geometry of Malagasy dialects + Malay (left) and Malagasy dialects + 
Maanyan (right). Any dialect or language is associated to a zenith angle 9 and an 
azimuth angle tp. The radial component is not plotted. The dialects are indicated with 
the same colors of the map in Fig. 2. 
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Figure 4. Lexical distances of Malagasy dialects from Malay and Maanyan. The 23 
dialects are indicated with the same colors of the map in Fig. 2. 



